Parallelizing XML data-streaming workflows via MapReduce
نویسندگان
چکیده
In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the Map-Reduce framework. Pipelines in our approach consist of sequences of processing steps that receive XMLstructured data and produce, often through calls to “black-box” (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML process pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.
منابع مشابه
Parallelizing XML Processing Pipelines via MapReduce
We present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that consume XML-structured data and produce, often through calls to “black-box” functions, modified (i.e., updated) XML structures. Our main contributions are a set of strategies for...
متن کاملParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Processing XML queries over big XML data using MapReduce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into different compute nodes. This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query processin...
متن کاملLarge Scale Machine Translation Architecture
Parallelization is widely considered to be the future of high performance computation, and is a natural choice when scaling up the machine translation systems. In this report, a programming model called MapReduce is investigated and two supporting components for MapReduce framework to work efficiently are analyzed, namely the distributed storage for streaming data and distributed storage for st...
متن کاملStubby: A Transformation-based Optimizer for MapReduce Workflows
There is a growing trend of performing analysis on large datasets using workflows composed of MapReduce jobs connected through producer-consumer relationships based on data. This trend has spurred the development of a number of interfaces—ranging from program-based to query-based interfaces—for generating MapReduce workflows. Studies have shown that the gap in performance can be quite large bet...
متن کاملParallelizing bioinformatics applications with MapReduce
Current bioinformatics applications require both management of huge amounts of data and heavy computation: fulfilling these requirements calls for simple ways to implement parallel computing. MapReduce is a general-purpose parallelization technology that appears to be particularly well adapted to this task. Here we report on its application, using its open source implementation Hadoop, to two r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Comput. Syst. Sci.
دوره 76 شماره
صفحات -
تاریخ انتشار 2010